Not Just Bigger: Towards Better-Quality Web Corpora

نویسندگان

  • Yannick Versley
  • Yana Panchenko
چکیده

For the acquisition of common-sense knowledge as well as as a way to answer linguistic questions regarding actual language usage, the breadth and depth of the World Wide Web has been welcomed to supplement large text corpora (usually from newspapers) as a useful resource. While purists’ criticism on unbalanced composition or text quality is easily shrugged off as unconstructive, empirical results in some real-world tasks have found Web corpora to be less useful than (smaller) newspaper corpora. More than the early criticism, evidence that Web corpora are doing poorly at their original purpose should raise concerns about the quality of Web corpora. Especially for non-English Web corpora, principled quality assessment and targeted improvements are instrumental in ensuring their relevance. In this paper, we present our own pipeline for Web corpora, which includes improvements regarding content-sensitive boilerplate detection as well as language filtering for mixedlanguage documents. We also provide a principled evaluation of the combination of corpora and (non-linguistic and linguistic) preprocessing between more standard types of large corpora (newspaper and Wikipedia) and different Web corpora. While our current results are focused on German-language Web corpora, both the content-sensitive boilerplate detection and our method of evaluation by constructing an artificial thesaurus from a wordnet are applicable to many other languages.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

How to Train good Word Embeddings for Biomedical NLP

The quality of word embeddings depends on the input corpora, model architectures, and hyper-parameter settings. Using the state-of-the-art neural embedding tool word2vec and both intrinsic and extrinsic evaluations, we present a comprehensive study of how the quality of embeddings changes according to these features. Apart from identifying the most influential hyper-parameters, we also observe ...

متن کامل

Automatic Parallel Corpora and Bilingual Terminology extraction from Parallel WebSites

In our days, the notion, the importance and the significance of parallel corpora is so big that needs no special introduction. Unfortunately, public available parallel corpora is somewhat limited in range. There are big corpora about politics or legislation, about medicine and other specific areas, but we miss corpora for other different areas. Currently there is a huge investment on using the ...

متن کامل

Langid.py for Better Language Modelling

Large corpora are crucial resources for building many statistical language technology systems, and the Web is a readilyavailable source of vast amounts of linguistic data from which to construct such corpora. Nevertheless, little research has considered how to best build corpora from the Web. In this study we consider the importance of language identification in Web corpus construction. Beginni...

متن کامل

Towards a Surface Realization-Oriented Corpus Annotation

Until recently, deep stochastic surface realization has been hindered by the lack of semantically annotated corpora. This is about to change. Such corpora are increasingly available, e.g., in the context of CoNLL shared tasks. However, recent experiments with CoNLL 2009 corpora show that these popular resources, which serve well for other applications, may not do so for generation. The attempts...

متن کامل

Mining Chinese-English Parallel Corpora from the Web

Parallel corpora are a crucial resource in research fields such as cross-lingual information retrieval and statistical machine translation, but only a few parallel corpora with high quality are publicly available nowadays. In this paper, we try to solve the problem by developing a system that can automatically mine high quality parallel corpora from the World Wide Web. The system contains a thr...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015